Tree-based Methods


Advantages and Disadvantages

  • Tree-based methods are simple and useful for interpretation

  • However they typically are not competitive with the best supervised learning approaches in terms of prediction accuracy

  • Hence we also discuss bagging, random forests, and boosting. These methods grow multiple trees which are then combined to yield a single consensus prediction

  • Combining a large number of trees can often result in dramatic improvements in prediction accuracy, at the expense of some loss interpretation


The Basics of Decision Trees





Terminology for Trees


Interpretation of Results


Details of the Tree-Building Process


Predictions



Pruning a Tree


Choosing the Best Subtree


Summary: Tree Algorithm


Classification Trees


Gini Index and Deviance


Advantages and Disadvantages of Trees


Bagging


Bagging Classification Trees


Out-of-Bag Error Estimation


Random Forests


Boosting


Boosting Algorithm for Regression Trees

  • Set \(\hat{f}(x) = 0\) and \(r_i = y_i\) for all \(i\) in the training set

  • For \(b = 1, 2, ..., B\) repeat:

    • Fit a tree \(\hat{f}^b\) with \(d\) splits (d + 1 terminal nodes) to the training data (X, r)

    • Update \(\hat{f}\) by adding in a shrunken version of the new tree:

      • \(\hat{f}(x) \leftarrow \hat{f}(x) + \lambda \hat{f}^b(x)\)
    • Update the residuals

      • \(r_i \leftarrow r_i - \lambda \hat{f}^b(x)\)
  • Output the boosted model

    • \(\hat{f}(x) = \sum_{b=1}^{B} \lambda \hat{f}^b(x)\)
  • Unlike fitting a single large decision tree to the data, which amounts to fitting the data hard and potentially overfitting, the boosting approach instead learns slowly

  • Given the current model, we fit a decision tree to the residuals from the model. We then add this new decision tree into the fitted function in order to update the residuals

  • Each of these trees can be rather small, with just a few terminal nodes, determined by the parameter d in the algorithm

  • By fitting small trees to the residuals, we slowly improve \(\hat{f}\) in areas where it does not perform well. The shrinkage parameter \(\lambda\) slows the process down even further, allowing more and different shaped trees to attack the residuals


Boosting for Classification


Tuning Parameters for Boosting


Summary